W2 Lab Assignment

Internet Movie Database (IMDb) provides various information about movies, such as total budgets, lengths, actors, and user ratings. They are publicly available from here. In this lab, let's explore a processed dataset named 'imdb.csv', which contains some basic information of movies.

Download the file from Canvas. There are 4 columns separated by tab:

  1. Title: title of the movie;
  2. Year: release year;
  3. Rating: average IMDb user rating;
  4. Votes: number of IMDB users who rated this movie

First, we want to get some insights from the data with Python; then we want to display information on a web page and prettify it with html/css.

Things to note:

  1. Let's use Python 3.5;
  2. There are 313,012 lines in the file. When printing things, print selectively.

Part 1. Data manipulation with Python

Q1: What is the first and last year in this dataset? How many movies released in each year?

To do this, we first need to read the CSV file. Python provides the csv module to read and write CSV files. The csv.reader function returns a Python object which will iterate over lines in the given file. Each line is returned as a list of strings, so that we can access a particular column using list index. If we want to ignore the first line, we can use islice. It is like slicing a list, but it can slice an iterator (e.g. file stream). For instance, islice(reader, 0, 5) means "give me the first 5 items from the reader". islice(reader, 1, 5) means "give me the 4 items starting from the second item".

A basic usage example to read the first 11 lines of 'imdb.csv':


In [20]:
import csv
from itertools import islice

f = open('imdb.csv', 'r')
reader = csv.reader(f, delimiter='\t')
for row in islice(reader, 0, 5):
    print(row)
    print(row[1])


['Title', 'Year', 'Rating', 'Votes']
Year
['!Next?', '1994', '5.4', '5']
1994
['#1 Single', '2006', '6.1', '61']
2006
['#7DaysLater', '2013', '7.1', '14']
2013
['#Bikerlive', '2014', '6.8', '11']
2014

There are many ways to do Q1. One way is to use dictionaries where the key: value pairs are:

  • key: year
  • value: a list of movie titles or number of movies

In [2]:
dt = {}
year = 1972
if year not in dt:
    dt[year] = 1
else:
    dt[year] += 1
print(dt)


{1972: 1}

Python automates the job above by using Counter.


In [3]:
from collections import Counter

movie_counter = Counter()
movie_counter[1972] +=1 
print(movie_counter[1972])
print(movie_counter[1970])


1
0

Once all lines are read, we want to print the dictionary, which can be done by iterating its key: value pairs.


In [4]:
for key,val in dt.items():
    print(key,val)
for key,val in movie_counter.items():
    print(key,val)


1972 1
1972 1

You can get the keys (the years) by using .keys() function.


In [5]:
movie_counter[1980] += 5
movie_counter[2015] += 1
movie_counter.keys()


Out[5]:
dict_keys([1980, 1972, 2015])

and you have convenient functions like min() and max() for calculating the min and max value of a list or iterable.


In [6]:
alist = [23,3,5,4,2,1,1,0,1000]
print(min(alist))
print(max(alist))


0
1000

Code for Q1


In [30]:
import pandas as pd
imdb = pd.read_csv('imdb.csv', delimiter='\t')

In [31]:
imdb.head()


Out[31]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

In [34]:
min(imdb['Year'])


Out[34]:
1874

In [35]:
max(imdb['Year'])


Out[35]:
2017

In [48]:
from collections import Counter
Counter(imdb["Year"])


Out[48]:
Counter({1874: 1,
         1878: 1,
         1887: 1,
         1888: 5,
         1889: 2,
         1890: 5,
         1891: 9,
         1892: 9,
         1893: 2,
         1894: 94,
         1895: 116,
         1896: 678,
         1897: 479,
         1898: 321,
         1899: 242,
         1900: 265,
         1901: 254,
         1902: 217,
         1903: 261,
         1904: 214,
         1905: 177,
         1906: 182,
         1907: 197,
         1908: 267,
         1909: 405,
         1910: 389,
         1911: 309,
         1912: 376,
         1913: 311,
         1914: 315,
         1915: 361,
         1916: 328,
         1917: 317,
         1918: 286,
         1919: 313,
         1920: 323,
         1921: 345,
         1922: 328,
         1923: 393,
         1924: 466,
         1925: 508,
         1926: 554,
         1927: 581,
         1928: 609,
         1929: 671,
         1930: 836,
         1931: 939,
         1932: 1026,
         1933: 1024,
         1934: 1120,
         1935: 1174,
         1936: 1235,
         1937: 1245,
         1938: 1230,
         1939: 1162,
         1940: 1160,
         1941: 1169,
         1942: 1193,
         1943: 1105,
         1944: 969,
         1945: 876,
         1946: 952,
         1947: 1010,
         1948: 1084,
         1949: 1208,
         1950: 1283,
         1951: 1318,
         1952: 1316,
         1953: 1393,
         1954: 1397,
         1955: 1476,
         1956: 1479,
         1957: 1604,
         1958: 1533,
         1959: 1572,
         1960: 1567,
         1961: 1623,
         1962: 1669,
         1963: 1635,
         1964: 1823,
         1965: 1896,
         1966: 2025,
         1967: 2086,
         1968: 2199,
         1969: 2320,
         1970: 2240,
         1971: 2370,
         1972: 2445,
         1973: 2325,
         1974: 2392,
         1975: 2286,
         1976: 2399,
         1977: 2264,
         1978: 2386,
         1979: 2526,
         1980: 2438,
         1981: 2485,
         1982: 2537,
         1983: 2647,
         1984: 2779,
         1985: 2908,
         1986: 2882,
         1987: 3049,
         1988: 3054,
         1989: 3193,
         1990: 3093,
         1991: 2993,
         1992: 3136,
         1993: 3128,
         1994: 3415,
         1995: 3698,
         1996: 3923,
         1997: 4353,
         1998: 4651,
         1999: 5138,
         2000: 5575,
         2001: 6042,
         2002: 6694,
         2003: 7355,
         2004: 8584,
         2005: 9508,
         2006: 10115,
         2007: 10147,
         2008: 11095,
         2009: 12268,
         2010: 12931,
         2011: 13944,
         2012: 13887,
         2013: 13048,
         2014: 10862,
         2015: 4402,
         2016: 2,
         2017: 1})

Q2: What is the average ratings/votes?

We can store the ratings/votes column as a list and then calculate various basic statistics (mean, median, etc.). To do this, we can use the NumPy library and call the function numpy.mean and numpy.median. For example,


In [10]:
import numpy as np

alist = [1,3,6,2,5,2]
print(np.mean(alist))
print(np.median(alist))

Code for Q2


In [41]:
# implement below
imdb['Rating'].mean()


Out[41]:
6.2961953413777811

In [42]:
imdb['Votes'].mean()


Out[42]:
1691.2317746021706

Q3: What are the 5 movies that have the highest ratings/votes?

Store the movie titles and ratings information as a dictonary:

  • key: movie title
  • value: movie rating

Then, we can sort the dictionary based on its values, which will return a list of tuples. Note to print only the top 5 movies.


In [12]:
import operator

dt = {1971: 2, 1975: 10, 1962: 1, 1980: 50, 1981: 55}
sorted_x_by_val = sorted(dt.items(), key=operator.itemgetter(1), reverse=True )
print(sorted_x_by_val)
for elem in sorted_x_by_val:
    print(elem[0],elem[1])


[(1981, 55), (1980, 50), (1975, 10), (1971, 2), (1962, 1)]
1981 55
1980 50
1975 10
1971 2
1962 1

Code for Q3


In [45]:
# implement below
import warnings
warnings.filterwarnings('ignore')
imdb.sort_index(by=['Rating'], ascending=[False]).head()


Out[45]:
Title Year Rating Votes
57863 Adolfo Perez Esquivel: Rivers of Hope 2015 9.9 9
42123 The Red Shirt Diaries 2014 9.8 6
140553 High-Rise 2015 9.8 5
131241 Girls Loving Girls 1996 9.8 5
24902 Mari White Presents the Newsboys 2011 9.7 6

In [47]:
imdb.sort_index(by=['Votes'], ascending=[False]).head()


Out[47]:
Title Year Rating Votes
279320 The Shawshank Redemption 1994 9.3 1511933
264590 The Dark Knight 2008 9.0 1487023
149895 Inception 2010 8.8 1285905
122656 Fight Club 1999 8.9 1189053
223981 Pulp Fiction 1994 8.9 1177471

Name the .ipynb file with file name 'lab02_lastname_firstname', and upload to Canvas under [w2] lab assingment.

Part 2. html and css

1. Set up a local web server

Many browsers don't allow loading files locally due to security concerns. We can get around by creating a local web server with Python by the following:

  1. Open the ‘Command Prompt’.
  2. Move to the folder where you keep your lab materials by typing ‘cd FOLDER_LOCATION‘. We will use this folder as the ‘root’ for our webserver.
  3. Then type ‘python -m SimpleHTTPServer’.

If successful, you'll see

Serving HTTP on 0.0.0.0 port 8000 …

This means that now your computer is running a webserver and its IP address is 0.0.0.0 and the port is 8000. Now you can open a browser and type "0.0.0.0:8000" on the address bar to connect to this webserver. Equivalently, you can type "localhost:8000". After typing, click on the different links. You can directly access one of these links by typing in ‘localhost:8000/NAME_OF_YOUR_FILE.html’ in the address bar.

2. html review

Webpages are written in a standard markup language called HTML (HyperText Markup Language). The basic syntax of HTML consists of elements enclosed within ‘<’ and ‘>’ symbols. Browsers such as Firefox and Chrome parse these tags and display the content of a webpage in the designated format. This is called rendering.

Here is a list of important tags and their descriptions.

  • html - Surrounds the entire document.
  • head - Contains info about the document itself. E.g. the title, any external stylesheets or scripts, etc.
  • title - Assigns title to page. This title is used while bookmarking.
  • body - The main part of the document.
  • h1, h2, h3, h4, h5, h6 - Headings (Smaller the number, larger the size).
  • p - Paragraph.
  • br - Line break.
  • em - emphasize text.
  • strong or b - Bold font.
  • a - Defines a hyperlink and allows you to link out to the other webpages.
  • img - Place an image.
  • ul, ol, li - Unordered lists with bullets, ordered lists with numbers and each item in list respectively.
  • table, th, td, tr - Make a table, specifying contents of each cell.
  • <!--> - Comments – will not be displayed.
  • span - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript.. It spans a section of text, say, within a paragraph.
  • div - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript. It stands for division and allocates a section of a page.

Use the top 5 voted movies found in the first part, try the following:

  1. Create a table with the following columns: Movie Title, Year, Rating, Votes.
  2. Create a link with each movie title to its IMDB page.
  3. Add a title for the table. Can you change its font and set it to bold?
  4. Change the background color of the page.
  5. Add an entry of your favorite movie to the table. Can you set the text to a different color to highlight it?

Test your code by visiting the web page on your local server. Name the .html file with file name 'lab02_html_lastname_firstname', and upload to Canvas.

3. CSS review

While HTML directly deals with the content and structure, CSS (Cascading Style Sheets) is the primary language that is used for the look and formatting of a web document.

A CSS stylesheet consists of one or more selectors, properties and values. For example:

body {   
    background-color: white;   
    color: steelblue;   
}

Selectors are the HTML elements to which the specific styles (combination of properties and values) will be applied. In the above example, all text within the ‘body’ tags will be in steelblue.

There are three ways to include CSS code in HTML. This is called ‘referencing’.

Embed CSS in HTML - You can place the CSS code within ‘style’ tags inside the ‘head’ tags. This way you can keep everything within a single HTML file but does make the code lengthy.

<head>
  <style type="text/css">
      .description {
      font: 16px times-new-roman;
      }
      .viz {
      font: 10px sans-serif;
      } 
    </style>
</head>

Reference an external stylesheet from HTML - This is a much cleaner way but results in the creation of another file. To do this, you can copy the CSS code into a text file and save it as a ‘.css’ file in the same folder as the HTML file. In the document head in the HTML code, you can then do the following:

<head>
  <link rel=”stylesheet” href=”stylesheet.css”>
</head>

Attach inline styles - You can also directly attach the styles in-line along with the main HTML code in the body. This makes it easy to customize specific elements but makes the code very messy - the design and content get mixed up.

<p style=”color: green; font-size:36px; font-weight:bold;”>
  Inline styles can help when using D3.
</p>

Can you redo questions 3-5 in the previous section with only css? Name the .ipynb file with file name 'lab02_css_lastname_firstname', and upload to Canvas.


In [ ]: